New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fast FBGEMM path KT.regroup_as #1910

Closed

dstaay-fb wants to merge 1 commit into pytorch:main from dstaay-fb:export-D56392296

Contributor

dstaay-fb commented Apr 22, 2024

Summary:
Use custom FBGEMM kernel when possible for inference/training. ~0-75% runtime speedup.

Benchmark Results [Forward]
[fallback] _regroup_keyed_tenors | B: 512 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 24.0
[prod] KeyedTensor.regroup | B: 512 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 36.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 48.0
[prod] KeyedTensor.regroup | B: 512 | F: 160 | device: cuda | Runtime (P90): 0.6 ms | Memory (P90): 72.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 320 | device: cuda | Runtime (P90): 1.9 ms | Memory (P90): 96.0
[prod] KeyedTensor.regroup | B: 512 | F: 320 | device: cuda | Runtime (P90): 0.7 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 640 | device: cuda | Runtime (P90): 4.6 ms | Memory (P90): 192.0
[prod] KeyedTensor.regroup | B: 512 | F: 640 | device: cuda | Runtime (P90): 1.3 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 1280 | device: cuda | Runtime (P90): 13.2 ms | Memory (P90): 384.0
[prod] KeyedTensor.regroup | B: 512 | F: 1280 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 80 | device: cuda | Runtime (P90): 0.3 ms | Memory (P90): 48.0
[prod] KeyedTensor.regroup | B: 1024 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 72.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 96.0
[prod] KeyedTensor.regroup | B: 1024 | F: 160 | device: cuda | Runtime (P90): 0.6 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 320 | device: cuda | Runtime (P90): 1.8 ms | Memory (P90): 192.0
[prod] KeyedTensor.regroup | B: 1024 | F: 320 | device: cuda | Runtime (P90): 0.9 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 640 | device: cuda | Runtime (P90): 4.1 ms | Memory (P90): 384.0
[prod] KeyedTensor.regroup | B: 1024 | F: 640 | device: cuda | Runtime (P90): 1.6 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 12.8 ms | Memory (P90): 768.0
[prod] KeyedTensor.regroup | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 3.1 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 96.0
[prod] KeyedTensor.regroup | B: 2048 | F: 80 | device: cuda | Runtime (P90): 0.5 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 160 | device: cuda | Runtime (P90): 0.7 ms | Memory (P90): 192.0
[prod] KeyedTensor.regroup | B: 2048 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 320 | device: cuda | Runtime (P90): 1.6 ms | Memory (P90): 384.0
[prod] KeyedTensor.regroup | B: 2048 | F: 320 | device: cuda | Runtime (P90): 1.4 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 640 | device: cuda | Runtime (P90): 4.8 ms | Memory (P90): 768.0
[prod] KeyedTensor.regroup | B: 2048 | F: 640 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 12.5 ms | Memory (P90): 1536.0
[prod] KeyedTensor.regroup | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 5.6 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 192.0
[prod] KeyedTensor.regroup | B: 4096 | F: 80 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 160 | device: cuda | Runtime (P90): 0.9 ms | Memory (P90): 384.0
[prod] KeyedTensor.regroup | B: 4096 | F: 160 | device: cuda | Runtime (P90): 1.4 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 320 | device: cuda | Runtime (P90): 1.7 ms | Memory (P90): 768.0
[prod] KeyedTensor.regroup | B: 4096 | F: 320 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 640 | device: cuda | Runtime (P90): 4.1 ms | Memory (P90): 1536.0
[prod] KeyedTensor.regroup | B: 4096 | F: 640 | device: cuda | Runtime (P90): 5.6 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 12.2 ms | Memory (P90): 3072.0
[prod] KeyedTensor.regroup | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 11.1 ms | Memory (P90): 4608.0

Benchmark Results [Fowrard + Backward]
[prod] KeyedTensor.regroup | B: 512 | F: 80 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 72.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 160 | device: cuda | Runtime (P90): 4.7 ms | Memory (P90): 144.0
[prod] KeyedTensor.regroup | B: 512 | F: 160 | device: cuda | Runtime (P90): 3.4 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 320 | device: cuda | Runtime (P90): 9.0 ms | Memory (P90): 288.0
[prod] KeyedTensor.regroup | B: 512 | F: 320 | device: cuda | Runtime (P90): 6.5 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 640 | device: cuda | Runtime (P90): 19.9 ms | Memory (P90): 576.0
[prod] KeyedTensor.regroup | B: 512 | F: 640 | device: cuda | Runtime (P90): 11.4 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 1280 | device: cuda | Runtime (P90): 46.7 ms | Memory (P90): 1152.0
[prod] KeyedTensor.regroup | B: 512 | F: 1280 | device: cuda | Runtime (P90): 23.1 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 80 | device: cuda | Runtime (P90): 2.6 ms | Memory (P90): 144.0
[prod] KeyedTensor.regroup | B: 1024 | F: 80 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 160 | device: cuda | Runtime (P90): 4.5 ms | Memory (P90): 288.0
[prod] KeyedTensor.regroup | B: 1024 | F: 160 | device: cuda | Runtime (P90): 3.9 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 320 | device: cuda | Runtime (P90): 8.8 ms | Memory (P90): 576.0
[prod] KeyedTensor.regroup | B: 1024 | F: 320 | device: cuda | Runtime (P90): 6.7 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 640 | device: cuda | Runtime (P90): 18.7 ms | Memory (P90): 1152.0
[prod] KeyedTensor.regroup | B: 1024 | F: 640 | device: cuda | Runtime (P90): 12.2 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 42.8 ms | Memory (P90): 2304.0
[prod] KeyedTensor.regroup | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 23.1 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 80 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 288.0
[prod] KeyedTensor.regroup | B: 2048 | F: 80 | device: cuda | Runtime (P90): 2.4 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 160 | device: cuda | Runtime (P90): 4.5 ms | Memory (P90): 576.0
[prod] KeyedTensor.regroup | B: 2048 | F: 160 | device: cuda | Runtime (P90): 4.2 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 320 | device: cuda | Runtime (P90): 8.9 ms | Memory (P90): 1152.0
[prod] KeyedTensor.regroup | B: 2048 | F: 320 | device: cuda | Runtime (P90): 7.7 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 640 | device: cuda | Runtime (P90): 19.2 ms | Memory (P90): 2304.0
[prod] KeyedTensor.regroup | B: 2048 | F: 640 | device: cuda | Runtime (P90): 12.9 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 45.1 ms | Memory (P90): 4608.0
[prod] KeyedTensor.regroup | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 26.4 ms | Memory (P90): 4608.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 80 | device: cuda | Runtime (P90): 2.4 ms | Memory (P90): 576.0
[prod] KeyedTensor.regroup | B: 4096 | F: 80 | device: cuda | Runtime (P90): 2.7 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 160 | device: cuda | Runtime (P90): 4.4 ms | Memory (P90): 1152.0
[prod] KeyedTensor.regroup | B: 4096 | F: 160 | device: cuda | Runtime (P90): 4.4 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 320 | device: cuda | Runtime (P90): 8.4 ms | Memory (P90): 2304.0
[prod] KeyedTensor.regroup | B: 4096 | F: 320 | device: cuda | Runtime (P90): 8.1 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 640 | device: cuda | Runtime (P90): 28.0 ms | Memory (P90): 4608.0
[prod] KeyedTensor.regroup | B: 4096 | F: 640 | device: cuda | Runtime (P90): 15.6 ms | Memory (P90): 4608.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 43.2 ms | Memory (P90): 9216.0
[prod] KeyedTensor.regroup | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 31.2 ms | Memory (P90): 9216.0

Differential Revision: D56392296

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented Apr 22, 2024

This pull request was exported from Phabricator. Differential Revision: D56392296

facebook-github-bot added the fb-exported label

dstaay-fb added a commit to dstaay-fb/torchrec that referenced this pull request


          Fast FBGEMM path KT.regroup_as (pytorch#1910)

42b8fab

Summary:

Use custom FBGEMM kernel when possible for inference/training.  ~0-75% runtime speedup.

Benchmark Results [Forward]
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  24.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  36.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90):  48.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 160      | device: cuda     | Runtime (P90):   0.6 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 320      | device: cuda     | Runtime (P90):   1.9 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 320      | device: cuda     | Runtime (P90):   0.7 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 640      | device: cuda     | Runtime (P90):   4.6 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 640      | device: cuda     | Runtime (P90):   1.3 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  13.2 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 1280     | device: cuda     | Runtime (P90):   2.2 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   0.3 ms | Memory (P90):  48.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   0.6 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   1.8 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   0.9 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 640      | device: cuda     | Runtime (P90):   4.1 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 640      | device: cuda     | Runtime (P90):   1.6 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  12.8 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):   3.1 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   0.5 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   0.7 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   1.6 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   1.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 640      | device: cuda     | Runtime (P90):   4.8 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 640      | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  12.5 ms | Memory (P90): 1536.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):   5.6 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   0.9 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   1.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   1.7 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 640      | device: cuda     | Runtime (P90):   4.1 ms | Memory (P90): 1536.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 640      | device: cuda     | Runtime (P90):   5.6 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  12.2 ms | Memory (P90): 3072.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  11.1 ms | Memory (P90): 4608.0

Benchmark Results [Fowrard + Backward]
  [prod] KeyedTensor.regroup          | B: 512      | F: 80       | device: cuda     | Runtime (P90):   2.2 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 160      | device: cuda     | Runtime (P90):   4.7 ms | Memory (P90): 144.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 160      | device: cuda     | Runtime (P90):   3.4 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 320      | device: cuda     | Runtime (P90):   9.0 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 320      | device: cuda     | Runtime (P90):   6.5 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 640      | device: cuda     | Runtime (P90):  19.9 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 640      | device: cuda     | Runtime (P90):  11.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  46.7 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  23.1 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   2.6 ms | Memory (P90): 144.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   4.5 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   3.9 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   8.8 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   6.7 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 640      | device: cuda     | Runtime (P90):  18.7 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 640      | device: cuda     | Runtime (P90):  12.2 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  42.8 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  23.1 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   2.4 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   4.5 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   4.2 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   8.9 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   7.7 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 640      | device: cuda     | Runtime (P90):  19.2 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 640      | device: cuda     | Runtime (P90):  12.9 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  45.1 ms | Memory (P90): 4608.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  26.4 ms | Memory (P90): 4608.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   2.4 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   2.7 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   4.4 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   4.4 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   8.4 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   8.1 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 640      | device: cuda     | Runtime (P90):  28.0 ms | Memory (P90): 4608.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 640      | device: cuda     | Runtime (P90):  15.6 ms | Memory (P90): 4608.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  43.2 ms | Memory (P90): 9216.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  31.2 ms | Memory (P90): 9216.0

Differential Revision: D56392296

dstaay-fb force-pushed the export-D56392296 branch from e2c6408 to 42b8fab Compare

April 23, 2024 00:18

Contributor

facebook-github-bot commented Apr 23, 2024

This pull request was exported from Phabricator. Differential Revision: D56392296

dstaay-fb added a commit to dstaay-fb/torchrec that referenced this pull request


          Fast FBGEMM path KT.regroup_as (pytorch#1910)

d172d1e

Summary:

Use custom FBGEMM kernel when possible for inference/training.  ~0-75% runtime speedup.

Benchmark Results [Forward]
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  24.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  36.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90):  48.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 160      | device: cuda     | Runtime (P90):   0.6 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 320      | device: cuda     | Runtime (P90):   1.9 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 320      | device: cuda     | Runtime (P90):   0.7 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 640      | device: cuda     | Runtime (P90):   4.6 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 640      | device: cuda     | Runtime (P90):   1.3 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  13.2 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 1280     | device: cuda     | Runtime (P90):   2.2 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   0.3 ms | Memory (P90):  48.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   0.6 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   1.8 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   0.9 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 640      | device: cuda     | Runtime (P90):   4.1 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 640      | device: cuda     | Runtime (P90):   1.6 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  12.8 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):   3.1 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   0.5 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   0.7 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   1.6 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   1.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 640      | device: cuda     | Runtime (P90):   4.8 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 640      | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  12.5 ms | Memory (P90): 1536.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):   5.6 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   0.9 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   1.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   1.7 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 640      | device: cuda     | Runtime (P90):   4.1 ms | Memory (P90): 1536.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 640      | device: cuda     | Runtime (P90):   5.6 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  12.2 ms | Memory (P90): 3072.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  11.1 ms | Memory (P90): 4608.0

Benchmark Results [Fowrard + Backward]
  [prod] KeyedTensor.regroup          | B: 512      | F: 80       | device: cuda     | Runtime (P90):   2.2 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 160      | device: cuda     | Runtime (P90):   4.7 ms | Memory (P90): 144.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 160      | device: cuda     | Runtime (P90):   3.4 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 320      | device: cuda     | Runtime (P90):   9.0 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 320      | device: cuda     | Runtime (P90):   6.5 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 640      | device: cuda     | Runtime (P90):  19.9 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 640      | device: cuda     | Runtime (P90):  11.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  46.7 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  23.1 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   2.6 ms | Memory (P90): 144.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   4.5 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   3.9 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   8.8 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   6.7 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 640      | device: cuda     | Runtime (P90):  18.7 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 640      | device: cuda     | Runtime (P90):  12.2 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  42.8 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  23.1 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   2.4 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   4.5 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   4.2 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   8.9 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   7.7 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 640      | device: cuda     | Runtime (P90):  19.2 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 640      | device: cuda     | Runtime (P90):  12.9 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  45.1 ms | Memory (P90): 4608.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  26.4 ms | Memory (P90): 4608.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   2.4 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   2.7 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   4.4 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   4.4 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   8.4 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   8.1 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 640      | device: cuda     | Runtime (P90):  28.0 ms | Memory (P90): 4608.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 640      | device: cuda     | Runtime (P90):  15.6 ms | Memory (P90): 4608.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  43.2 ms | Memory (P90): 9216.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  31.2 ms | Memory (P90): 9216.0

Differential Revision: D56392296

dstaay-fb force-pushed the export-D56392296 branch from 42b8fab to d172d1e Compare

April 23, 2024 00:57

Contributor

facebook-github-bot commented Apr 23, 2024

This pull request was exported from Phabricator. Differential Revision: D56392296


          Fast FBGEMM path KT.regroup_as (pytorch#1910)

12385ab

Summary:

Use custom FBGEMM kernel when possible for inference/training.  ~0-75% runtime speedup.

Benchmark Results [Forward]
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  24.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  36.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90):  48.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 160      | device: cuda     | Runtime (P90):   0.6 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 320      | device: cuda     | Runtime (P90):   1.9 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 320      | device: cuda     | Runtime (P90):   0.7 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 640      | device: cuda     | Runtime (P90):   4.6 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 640      | device: cuda     | Runtime (P90):   1.3 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  13.2 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 1280     | device: cuda     | Runtime (P90):   2.2 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   0.3 ms | Memory (P90):  48.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   0.6 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   1.8 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   0.9 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 640      | device: cuda     | Runtime (P90):   4.1 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 640      | device: cuda     | Runtime (P90):   1.6 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  12.8 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):   3.1 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   0.5 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   0.7 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   1.6 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   1.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 640      | device: cuda     | Runtime (P90):   4.8 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 640      | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  12.5 ms | Memory (P90): 1536.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):   5.6 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   0.9 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   1.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   1.7 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 640      | device: cuda     | Runtime (P90):   4.1 ms | Memory (P90): 1536.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 640      | device: cuda     | Runtime (P90):   5.6 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  12.2 ms | Memory (P90): 3072.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  11.1 ms | Memory (P90): 4608.0

Benchmark Results [Fowrard + Backward]
  [prod] KeyedTensor.regroup          | B: 512      | F: 80       | device: cuda     | Runtime (P90):   2.2 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 160      | device: cuda     | Runtime (P90):   4.7 ms | Memory (P90): 144.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 160      | device: cuda     | Runtime (P90):   3.4 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 320      | device: cuda     | Runtime (P90):   9.0 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 320      | device: cuda     | Runtime (P90):   6.5 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 640      | device: cuda     | Runtime (P90):  19.9 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 640      | device: cuda     | Runtime (P90):  11.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  46.7 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  23.1 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   2.6 ms | Memory (P90): 144.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   4.5 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   3.9 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   8.8 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   6.7 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 640      | device: cuda     | Runtime (P90):  18.7 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 640      | device: cuda     | Runtime (P90):  12.2 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  42.8 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  23.1 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   2.4 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   4.5 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   4.2 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   8.9 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   7.7 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 640      | device: cuda     | Runtime (P90):  19.2 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 640      | device: cuda     | Runtime (P90):  12.9 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  45.1 ms | Memory (P90): 4608.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  26.4 ms | Memory (P90): 4608.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   2.4 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   2.7 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   4.4 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   4.4 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   8.4 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   8.1 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 640      | device: cuda     | Runtime (P90):  28.0 ms | Memory (P90): 4608.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 640      | device: cuda     | Runtime (P90):  15.6 ms | Memory (P90): 4608.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  43.2 ms | Memory (P90): 9216.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  31.2 ms | Memory (P90): 9216.0

Reviewed By: PaulZhang12

Differential Revision: D56392296

dstaay-fb force-pushed the export-D56392296 branch from d172d1e to 12385ab Compare

April 24, 2024 01:56

Contributor

facebook-github-bot commented Apr 24, 2024

This pull request was exported from Phabricator. Differential Revision: D56392296

facebook-github-bot closed this in

76e854c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported